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Abstract 

The knowledge of the Free Energy Landscape topology is the essential key to understand many biochem- 
ical processes. The determination of the conformers of a protein and their basins of attraction takes a 
central role for studying molecular isomerization reactions. In this work, we present a novel framework 
to unveil the features of a Free Energy Landscape answering questions such as how many mcta-stable 
conformers are, how the hierarchical relationship among them is, or what the structure and kinetics of the 
transition paths are. Exploring the landscape by molecular dynamics simulations, the microscopic data 
of the trajectory arc encoded into a Conformational Markov Network. The structure of this graph reveals 
the regions of the conformational space corresponding to the basins of attraction. In addition, handling 
the Conformational Markov Network, relevant kinetic magnitudes as dwell times and rate constants, or 
hierarchical relationships among basins, complete the global picture of the landscape. We show the power 
of the analysis studying a toy model of a funnel-like potential and computing efhciently the conformers 
of a short peptide, the dialanine, paving the way to a systematic study of the Free Energy Landscape in 
large peptides. 

Author Summary 

A complete description of complex polymers, such as proteins, includes information about their structure 
and their dynamics. In particular it is of utmost importance to answer the following questions: What 
arc the structural conformations possible? Is there any relevant hierarchy among these conformers? 
What are the transition paths between them? These and other questions can be addressed by analyzing 
in an efficient way the Free Energy Landscape of the system. With this knowledge, several problems 
about biomolecular reactions (such as enzymatic activity, protein folding, protein deposition diseases, 
etc) can be tackled. In this article we show how to efficiently describe the Free Energy Landscape 
for small and large peptides. By mapping the trajectories of molecular dynamics simulations into a 
graph (the Conformational Markov Network), and unveiling its structural organization, we obtain a 
coarse grained description of the protein dynamics across the Free Energy Landscape in terms of the 
relevant kinetic magnitudes of the system. Therefore, we show the way to bridge the gap between the 
microscopic dynamics to the macroscopic kinetics by means of a mesoscopic description of the associated 
Conformational Markov Network. Along this path the compromise between the physical nature of the 
process and the magnitudes that characterize the network is carefully kept to assure the reliability of the 
results shown. 
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Introduction 

Polymers and, more specifically, proteins, show complex behavior at the cellular system level, e.g. in 
protein-protein interaction networks [1], and also at the individual level, where proteins show a large 
degree of multistability: a single protein can fold in different conformational states [2-4]. As a complex 
system [5,6], the dynamics of a protein cannot be understood by studying its parts in isolation, instead, 
the system must be analyzed as a whole. Tools able to represent and handle the information of the entire 
picture of a complex system are thus necessary. 

Complex network theory [7,8] has proved to be a powerful tool used in seemingly different biologically- 
related fields such as the study of metabolic reactions, ecological and food webs, genetic regulatory 
systems and the study of protein dynamics [7]. In this latter context, diverse studies have analyzed 
the conformational space of polymers and proteins making use of network representations [9-12], where 
nodes account of polymer conformations. Additionally, some studies have tried to determine the common 
and general properties of these conformational networks [13,14] looking at magnitudes such as clustering 
coefficient, cyclomatic number, connectivity, etc. Recently, trying to decompose the network in modules 
corresponding to the free energy basins, the use of community algorithms over these conformational 
networks have been proposed [15]. Although this approach has opened a promising path for the analysis 
of Free Energy Landscapes (FEL), the community based description of the network leads to multiple 
characterizations of the FEL and thus it is difficult to establish a clear map from the communities found 
to the basins of the FEL. 

A similar approach, commonly used to analyze the complex dynamics, is the construction of Markovian 
models. Markovian state models let us treat the information of one or several trajectories of molecular 
dynamics (MD) as a set of conformations with certain transition probabilities among them [9,16,17]. 
Therefore, the time-continuous trajectory turns into a transition matrix, offering global observables as 
relaxation times and modes. In [16-18] the use of Markovian models is proposed with the aim of detecting 
FEL meta-stable states. However, the above approaches to analyze FELs of peptides involves extremely 
large computational cost: either general community algorithms or large transition matrices. 

Finally, other strategies to characterize the FEL that have successfully helped to understand the 
physics of biopolymers, are based on the study of the Potential Energy Surface (PES) [3,4, 19-21]. The 
classical transition-state theory [22] allows us to project the behavior of the system at certain temperature 
from the knowledge of the minima and transition states of the PES. This approach entails some feasible 
approximations, such as harmonic approximation to the PES, limit of high damping, assumption of high 
barriers, etc. These approximations could be avoided working directly from the MD data. 

In this article we make a novel study of the FEL capturing its mesoscopic structure and hence 
characterizing conformational states and the transitions between them. Inspired by the approaches 
presented in [12, 15] and [16, 17], we translate a dynamical trajectory obtained by MD simulations into 
a Conformational Markov Network. We show how to efficiently handle the graph to obtain, through 
its topology, the main features of the landscape: conformers and their basins of attraction, dwell times, 
rate constants between the conformational states detected and a coarse-grained picture of the FEL. The 
framework is shown and validated analyzing a synthetic funnel-like potential. After this, the terminally 
blocked alanine peptide (Ace-Ala-Nme) is studied unveiling the main characteristics of its FEL. 

Methods 

In this section we show the round way of the FEL analysis: the map of microscopic data of a MD into 
a Conformational Markov Network (CMN) and, by unveiling its mesoscopic structure, the description of 
the FEL structure in terms of macroscopic observables. 



Exploring the Free Energy Landscape 



3 



Translating the FEL into a network 

First, we encode a trajectory of a stochastic MD simulation into a network: the CMN. This map wiU 
allow us to use the tools introduced henceforth to analyze a specific dynamics of complex systems such 
as biopolymers. 

Conformational Markov Network 

The CMN has been proven to be a useful representation of large stochastic trajectories [10, 11, 15]. This 
coarse grained picture is usually constructed by discretizing the conformational space explored by the 
dynamical system and considering the hops between the different configurations as dictated by the MD 
simulation. In this way, the nodes of a CMN are the subsets of configurations defined by the confor- 
mational space discretization and the links between nodes account for the observed transitions between 
them. The information of the stochastic trajectory allows to assign probabilities for the occupation of 
a node and for the transitions between two different configurations. Defined as above, a CMN is thus a 
weighted and directed graph. 

Our CMN is constructed as follows. The conformational space is divided into N cells of equal volume, 
therefore every node i {i = 1, ■■■,N) of the CMN contains the same number of possible configurations. 
Next, by evolving a stochastic trajectory enough time steps (of length At) to assure the ergodicity 
condition we can define the final CMN set up. We assign to each node a weight, Pi, that accounts 
for the fraction of trajectory that the system has visited any of the configurations contained in node i 
(the following normalization X^i^i = 1 holds). Second, a value Pij is assigned to each directional link, 
accounting for the number of hops from node j to node i. Note that transitions between configurations 
contained in the same node are also considered by Pa, i.e. the network can also contain self-loops. Finally, 
the weights of the outgoing links from a node j, {Pij}, are conveniently normalized so that Pij = 1. 

The CMN constructed in this way, is described by a single matrix S = {Pij} and a vector whose 
components are the occupation probabilities P = {Pi}. Hence, the matrix S is the transition probability 
matrix of the following Markov chain, 

v{t + At) = Svit) (1) 

where v(t) is the instant probability distribution of the system at time t. Since the matrix S is ergodic 
and time invariant, one can compute the stationary distribution associated to the Markov chain, v^, that 
satisfies = Sv^. The latter stationary distribution has to be identical to the computed weights of the 
network nodes. Pi = vf {i — 1, ...,N), provided the stochastic trajectory is long enough. Moreover, the 
detailed balance condition, 

PjiPi = PijPj (2) 

must hold thus relating the elements of matrix S to the stationary probability distribution. Therefore, the 
transition matrix S appears to be the minimal descriptor of the stochastic trajectory and, as consequence, 
of the CMN. 

Markovity 

Provided the MD trajectory is long enough to consider the sample in equilibrium, the weight-distribution 
of nodes in the CMN will be the stationary solution of Eq. (1) and detailed balance condition (2) will 
be fulfilled [23]. However, this property is not enough to consider the model Markovian: Although the 
continuous trajectory will be produced using Langevin dynamics (and therefore inherently Markovian in 
the phase space [24,25]) the discrete representation of the CMN and the integration of momenta defies the 
Markovian character of our model [24,26-28]. Several methods are proposed in the literature to validate 
Markov models [16,27,29]. In order to obtain a reliable description, specially about those magnitudes 
related to the time evolution of the system (see subsection Temporal hierarchy of basins), the time step 
At must be large enough to avoid memory effects [27]. 
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A detailed check and discussion about the Markovian character of the networks shown in this article 
can be found in the Supporting Information Text (SI). 

Funnel-like potential 

To iUustrate the CMN approach and the methods presented below, we introduce here a synthetic potential 
energy function, that serve us as a toy model where results can be easily interpreted. This potential energy 
is reminiscent of that funnel surfaces recurrently found when the FEL of proteins are studied [30,31]. 
In particular, we have considered a two-dimensional system where a particle moves in contact with a 
thermal reservoir and whose potential energy is given by, 

V{x, y) = 2 ^ y'^^ ^ '-"o "1 sin(&ia:) + 02 sin(62a;) 

+a3sin(632/))e-5(-'+?'')' , (3) 

where we have set Gq = 4, oi = 1.6, a2 = 0.8, 03 = 1.2, bi = 6, 62 = 15 and ~ 6. As shown in Figure 
lA the above potential energy confines the movement of the stochastic trajectory inside a finite region 
of the conformational space. However, thermal fluctuations allow the particle to jump between several 
basins of attraction. 




-1 -0.5 0.5 



Figure 1. SSD algorithm applied to a synthetic funnel-like potential. (A) 2D funnel-like 
potential. (B) A stochastic trajectory is translated into a CMN where 6 sets of nodes (corresponding to 
different color) are the result of the SSD algorithm. (C) Recovering the spatial coordinates, the 
stationary probabilities of each node are shown in color code. The 6 basins detected are represented as 
color striped regions. (D) A coarse-grained CMN is built where new nodes take the role of the basins. 
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A stochastic trajectory has been simulated using an overdamped Langevin dynamics and the equations 
of motion have been integrated with a fourth order stochastic Runge-Kutta method [32]. Figure IC shows 
the region of the conformational space visited by the particle. We have conveniently discretizcd the two- 
dimensional space into pixels of equal area and computed their corresponding occupation probabilities. 
Thereby, with the transition probabilities between pixels, the trajectory is represented as the CMN shown 
in Figure IB. The question now is: can we recover the topology of the FEL (derived from Eq. (3)) from 
the CMN representation? 

Analyzing the FEL through the network 

Up to now, we have illustrated the conversion of molecular dynamics data into a graph (the CMN). 
Now, we show how to efficiently obtain the thermo-statistical data from the mesoscopic description of 
the CMN. 

Revealing structure: conformational basins 

Inspired by the deterministic steepest descent algorithm to locate minima in a potential energy surface we 
propose a Stochastic Steepest Descent (SSD) algorithm to define basins on the discretized FEL. Dealing 
with the nodes and links as we describe below, the proper structure of the CMN is unveiled to call the 
modules obtained conformational macro-states or basins. 

Picking at random one node of the CMN, say a, and an initial probability distribution Pi{0) ~ 6i_a {i = 
1, ...,N), the Markov process relaxes according to P{^t) = SP(0). The whole probability concentrated 
in node a at time 0, in a single time step At, evolves driving the maximum amount of probability down 
hill over the FEL. The next node b in the descendent pathway from a is taken by following the link 
that carries maximum probability flux. Focusing again all the probability in node 6, Pi{l) = Si_b we 
continue the pathway from a towards a local FEL minimum by identifying the next node c for which the 
probability current Pc,b is maximal. Iterating this operation for each node of the CMN, we obtain a set 
of disconnected descent pathways that help us to define the basins of attraction. 

We establish formally the above procedure assisted by a vector Q = {uJi} (with i = 1, N) that label 
the nodes: 

(i) We start by assigning H = 0. 

(ii) Select at random a node / with = (i.e., I has not been labeled yet) and start to write an 
auxiliary list V of nodes adding / as the first entry in this list. 

(Hi) Search, within the neighbors of the node I, a node m so that Pmi = max{Pj,i,\/j ^ 1} [33] and 
check which of the following options is fulfilled: 

A If Prni > Pim and cUm = 0: add m to the list V and go again to (Hi) taking m in the place of I. 

B If Pmi > Pim and ujm then write the labels of all the nodes in the list V as ujj = ujm 
Vj € V. The process continues going to step (ii). 

C If Pmi < Pim the link / — » m is removed from the graph. The process returns to step (Hi) with 
the next exception: since this step has been iterated 2D times for the same node I (being D 
the number of coordinates discretized to construct the CMN), / is stated as local minimum 
and uji =1. In this case ujj = I for those nodes j (z V and the process comes back to step (ii). 

The whole procedure ends when no nodes unlabeled remain in the CMN, Ui Vi. The restriction 
introduced in step (iiiC) with the dimensionality D avoids a transition from a local minimal energy 
configuration to any other node of the same basin or to a deeper local minimum of a different basin. 
When every node of the CMN has been visited, the conformational space is completely characterized and 
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wc have thus traced aU the maximum descent pathways from any node to the local FEL minima. Finally, 
all those nodes with the same label uji belong to the same FEL basin and therefore they arc associated 
to the same conformational state. The result of the procedure is the partition of the CMN in a set of 
modules which correspond to basins of attraction of the discrctized conformational space. 

To illustrate the basin decomposition of a CMN, the SSD algorithm has been applied to the funnel-like 
potential. The result is the detection of six basins in agreement with the number of local minima in its 
FEL (Figure IB and IC). 

Comparing with other community algorithms 

With the aim of studying biomolecules and systems with high degree of dimensionality, the way to detect 
these FEL basins must be computationally efficient. The method described above takes a computational 
time 0{2DN), once the 2D largest hooping probabilities Pji are computed for all the nodes in the 
network. Additionally, the method is deterministic providing with a unique partition of the CMN into 
different modules. These two characteristics make this analysis faster and more straightforward than any 
other partitioning method [34]. These advantages come from the knowledge of the physical meaning of 
links and nodes of CMNs. In the SI other community algorithms (Newman's modularity and Markov 
Clustering algorithm) are tested over our toy model system. None of the algorithms reported in the SI 
give a satisfactory result mapping the modules obtained with the free energy basins. 

Coarse- Grained CMN 

To get a more comprehensible representation of the FEL studied, a new CMN network can be built 
by taking the basins as nodes. The occupation probabilities of these nodes as well as the transition 
probabilities among them can be obtained from those of the original CMN as 



where i and j are indexes relative to the nodes of the original CMN and a and p are indexes for the 
basins (new nodes). Note that the new coarse-grained CMN has its weights normalized and fulfills the 
detailed balance condition Eq. (2). Figure ID shows the corresponding coarse-grained for the fimnel-like 
potential. 

The weighted nodes and links have a clear physical meaning [19]. Considering the transition state 
a — > /3 and assuming local " intra-basin" equilibrium, the rate constant of this transition is ka[} — P^a/Ai 
(where At is the time interval between snapshots used to make the original CMN). The relative free 
energy of the basin a, taking basin /3 as reference, is AFa ~ —k'QTlog{Pa/ Pp) ■ Besides, the expected 
waiting time to escape from a to any adjacent basin is Tq = At/{1 — Paa) [35]. Other magnitudes, such 
as first-passage time for inter-basins transitions and other rate constants relaxing the local equilibrium 
condition [19, 35] can also be computed from the original CMN. 

The ability to define the proper regions of the conformational space in an efficient way let us compute 
physical magnitudes of relevance. For instance, the coarse-grained CMN is nothing but a graphical 
representation of a kinetic model with n (the number of basins) coupled differential equations: 




(4) 




(5) 




(6) 



Exploring the Free Energy Landscape 



7 



Free Energy hierarchical basin organization 

The first hierarchy aims to answer the foUowing question: What is the structure of the CMN when nodes 
with lower weight than a certain threshold are removed together with their links? Let us take the control 
parameter F/k-QT as the threshold to restrict the existence of the nodes in the original CMN. Where 
Fi/k-QT = log{Pyj) — log{Pi) is the "adimensional free energy" of a node i relative to the most weighted 
node w. 

With the above definitions we start a CMN reconstruction by smoothly increasing the threshold from 
its zero value. At each step of this process, we obtain a network composed of those nodes with free energy 
lower than the current threshold value. As the free-energy cut-off increases, new nodes emerge together 
with their links. These new nodes may be attached to any of the nodes already present in the network 
or they can emerge as a disconnected component. At a certain value of F/kBT, some components of the 
network become connected by the links of a new node incorporated at this step. A Hoshcn-Kopelman 
like algorithm [36] is used to detect the disconnected components of the network at each value of the 
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Figure 2. Hierarchies of the basins detected for the funnel-like potential, (a) Free Energy 
hierarchy: based on the relative free-energy of the nodes, (b) Temporal hierarchy: number of basins 
defined by SSD for the different networks built by eq. (7). The original basins merge in function of 
time. Both hierarchies reveals a coarse-grained behavior of two macro-states: {aUbU c) and (d U e U /). 
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threshold used: from zero until all the N nodes of the CMN were already attached. 

This bottom-up network reconstruction provides us with a hierarchical emergence of nodes along with 
the way they join together. This picture can be better described by a process of basins emergence and 
linking that is easily represented by means of a basin dendogram. This representation let us guess at 
first glance the hierarchical relationship of the conformational macro-states and the height of the barriers 
between them. Let us remark that the transition times cannot be deduced from these qualitative barriers 
since the entropic contribution or the volume of the basin are not reflected in this diagram. The basins 
family-tree obtained for the funnel-like (see Figure 2 A) reveals that, despite of having a two-dimensional 
potential with the shape of a funnel, one cannot describe it as a sequence of metastable conformations 
that drive the system to the global minimum. Moreover, the diagram shows a roughly similar behavior 
as for a double asymmetric well, composed by two sets of basins: (a U 6 U c) and (d U e U /). 

Temporal hierarchy of basins 

The CMN representation of a MD simulation provides with another hierarchical relationship that is 
meaningful to understand the behavior of the biological systems. The links of the original CMN have 
been weighted according to the stochastic matrix S = {Pij}- Taking into account the Markovian character 
of the process, we can make use of the Chapman-Kolmogorov equation to generate new transition matrices 
at times r = 2 At, 3 At, etc... Formally, the Markov chain at sampling time niAt is defined by the matrix: 

S(mAi) = [S(At)]" . (7) 

For each value of m a new CMN is defined. This family of CMNs have different weighted links but 
the same weights Pi for the nodes as the original one (m = 1). It is worth to discuss the behavior of the 
matrix S{mAt). In the limit m ^ oo we have P = S{niAt)P{0) , independently of the initial state P(0). 
This means that any node is connected to a given node of the network with the same weight, regardless 
of the initial source. Therefore, only one basin would be detected by the SSD algorithm since every node 
is connected with the most weighted link to the most weighted node in only one step. 

From the original Ai-description of the FEL to the integrated {mAt — > oo) one, we can devise another 
algorithm to establish a second hierarchy of the basins by performing the next two operations: First, 
for each value of m a new CMN is generated by constructing the matrix [S'(At)]™ and second, the SSD 
algorithm is applied to this new CMN. The process finishes when only a basin (the whole network) is 
detected (for large enough values of m). By using this technique we can observe how basins merge with 
others at different time scales (labeled by the integer m). 

The result of this procedure performed for the funnel- like potential is shown in Figure 2B. At r « 500 
only two basins are observed: (a U 6 U c) and (d U e U /), being the largest plateau observed for any of the 
nontrivial arrangement of basins found. Therefore, the macroscopic description in time is in agreement 
with the Free Energy hierarchy described previously. It is clear that the number of basins decrease as m 
increases. One should be aware that the concept of basin depends dramatically on the time resolution 
at which the CMN is built, and this time limits also the resolution in the FEL structure. Note also that 
this procedure provides with useful information similar to the structural decorrelation time [37]. 



Results 

The Alanine dipeptide. 

The alanine dipeptide, or terminally blocked alanine peptide (Ace-Ala-Nme, Figure 3A), is the most 
simple "biological molecule" that exhibits the common features shown by larger biomolecules. Despite 
of its simplicity, this system has more than one long-life conformational state with different transition 
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pathways. Since the first attempt by Rossky and Karplus [38] to model this dipeptide solvatcd, this 
system has been widely studied in theoretical works [39-42]. The alanine dipeptide has been also the 
appropriate molecule to test tools to explore the FEL [15, 16, 43] and, specifically, to study reaction 
coordinates [42,44]. 

The alanine dipeptide has two slow degrees of freedom, the rotatable bonds (f> (C-N-Cq-C) and (N- 
Cq-C-N) (sec Figure 3A). The FEL projected onto these dihedral angles let us identify the conformational 
states that characterize the geometry of biopolymers, namely: alpha helix right-handed (q!_r), alpha 
helix left-handed (ai,), beta strands (C^egjC^), etc. The number of local minima in the (0, ip) space 
depends on the effective potential model used to simulate the system. Up to date, electronic structure 
calculations have identified a total of nine different conformers [45] . Regarding MD simulations different 
conformational states have been observed: (i) using classical force fields with explicit solvent up to six 
conformers are detected [16,39,40], (ii) at least four stable states by using implicit solvent [15,39,41], 
and (in) two stable conformers in vacuum conditions [39,42]. On the other hand, since the angles (/) 




-180 -90 90 180 

o 

Figure 3. Free energy basins of the Alanine dipeptide. (A) The dialanine dipeptide with the 
angles (j> and ip. (B) Plot of the CMN generated. The 6 sets of nodes (corresponding to different colors) 
are the result of the SSD algorithm. (C) Left: Ramachandran plot with the probability of occupation of 
the cells used to build the CMN. The boundaries of the free energy basins are shown in white. Right: 
the 6 basins represented as regions of different color. (Color code: orange = or, red = q;l, yellow = 
Cjax, pink = Cjeq, green = C5 and blue = a'). 
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and ■0 seem appropriate to distinguish the mctastable states, the kinetics between them is not accurately 
described with this choice of reaction coordinates, the solvent coordinates and/or other internal degrees 
of freedom must be taken into account [42,44]. 

We have used SSD algorithm to detect the local minima and their corresponding basins for this 
molecule in the (j)-ip space. For this purpose, a Langevin MD simulation of 250 ns has been performed at 
a temperature of 400 K (see the 5"/ for further details). Additionally the CMN has been built dividing 
the ((/), V') Ramachandran plot into cells of surface 9° x 9° (40 x 40) and taking dialaninc conformations 
at time intervals of At = 0.01 ps. The resulting CMN have a total of n = 1505 nodes and e = 26324 
directed links. 

The SSD algorithm applied to the CMN network reveals 6 basins. Figure 3B shows the resulting net- 
work where nodes belonging to the same basin take the same color. Bringing back this information to the 
Ramachandran map, these 6 sets of nodes define 6 regions represented in Figure 3C. To better illustrate 
this division, other representation, where each region has a different color, is shown. By comparing with 
previous studies on this molecule, we identify the regions in orange, red, yellow and pink with conformers 
Q!_R, ctL, Ciax and C^eq respectively [39,46]. Besides, region green corresponds to conformer C5 and the 
blue one to a' [16,46]. Remarkably, the basin (one of the less populated state) has been visited 1155 
times with a mean stay time of 4.20 ps. 

We now look at the coarse-grained picture of the FEL by describing the properties of the 6 basins 
detected. The different weights of the basins are related to the free energy of the corresponding confor- 
mational macro-states. In Table 1 these energy differences AFa = —k-BTlog{Pa/ Pct^c,) are shown, taking 
the most populated basin as the energy reference Fcj^^ = 0. The lowest free energy basins correspond 
to configurations with < 0° {Cjeq, ctR, C5 and a'), whereas the two other conformers located in the 
region (f> > 0° have the highest free energy but the largest dwell time. Moreover, we have also analyzed 
the trapping efficiency of each basin by reporting the mean escape time (At/(1 — Paa)) as well in Table 
1. 



Table 1. Relative free energies and Mean Escape Time of the basins defined by SSD. 

Basin Fi — Fc-,^^ kcal/mol Mean Escape Time (ps) 



C7eq 0.00 0.52 

an 0.45 0.42 

ul 2.42 4.20 

C7ax 3.84 0.71 

C5 0.55 0.28 

a' 0.90 0.23 



Table 2. Characteristic times for direct inter-basins transitions. 
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The FEL can be represented as a dendogram, see Figure 4, where the hierarchical map of the con- 
formers based on Free Energy gives at first glance a global picture of the landscape. Remarkably, the 
conformer , despite of having one of the highest free energy, looks like the metastable state with longest 
life. This result is supported by the values of Mean Escape Time shown in Table 1. 



10 



s 



6 



C7e 



-7ax 



Figure 4. Dendogram based on the relative Free Energy of the CMN nodes. Two sets of 
basins are clearly distinguished with a high free energy barrier in between: {Cyeq, «_r, Cq, a') and 
(CyaxjCtL)- Notc that Q!L looks like the conformer with the largest dwell time, in agreement with data in 
Table 1. 
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"l \ ' \ 1 
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Figure 5. Dendogram based on the temporal hierarchy of basins. In around 100 ps the 
peptide finds the way to reach the global minimum, conformer C^eq, from any basin. 



Exploring the Free Energy Landscape 



12 



The alanine dipeptide has been also studied because of its "fast" isomerization Cjeq and a "slow" 

transition an — > Cyax ■ Our coarse-grained picture of the FEL also allows us to extract information about 
these transitions. In Table 2 we show some of the relevant characteristic transition times from a basin 
a to an adjacent basin 6, 1/fcba- [The whole data is shown in the SI.] Transitions between basins with 
the same sign of are remarkably faster (e.g Cieq ^ ck_r and ^ C^ax)- While slow transitions 
are observed for those hops crossing the line = 0° (Cveg ^ and aji — > C^ax), showing them as 
rare events. Instead, the alanine dipeptide finds more easily paths to go to > 0° conformers through 
Cs Cjax and a' ^ ul by crossing </> = 180°. 

To round off the description of the FEL, the dendogram corresponding to the temporal hierarchy is 
shown in Figure 5. From the figure, it becomes clear that the behavior of the dialanine depends on the 
time scale used for its observation. Again, the same two different sets of conformers are distinguished 
from this hierarchy. Additionally, the global minimum conformer is reached in around 100 ps from any 
basin. 

Finally, the magnitudes computed here for the alanine dipeptide would allow to construct a first- 
order kinetic model of 6 coupled differential equations as Eq. (6) (assuming equilibrium intra-basin) . 
This model contains the same information as the kinetic model by Chekmarev et al. for the irreversible 
transfer of population from aji — > C-jax [41]. 

Discussion 

Hierarchical landscapes characterize the dynamical behavior of proteins, which in turn depends on the 
relation between the topology of the basins, their transitions paths and the kinetics over energy barriers. 
The CMN analysis of trajectories generated by MD simulations is a powerful tool to explore complex 
FELs. 

In this article, we have proposed how to deal with a CMN to unveil the structure of the FEL in a 
straightforward way and with a remarkable efficiency. The analysis presented here is based on the physical 
concept of basin of attraction, making possible the study of the conformational structure of peptides and 
the complete characterization of its kinetics. Note that this has been done without the estimation of the 
volume of each conformational macro-state in the coordinates space and without the 'a priori' knowledge 
of the saddle points or the transition paths from a local minimum to another. 

On the other hand, the framework introduced in the article provides us with a quantitative description 
of the dialanine's FEL, coming up directly from a MD dynamics at certain temperature. The peptide 
explores its landscape building the corresponding CMN and the success of extracting the relevant infor- 
mation is up to the ability of dealing with it. Neither the FE basins were defined by the unique criterion 
of clustering conformations with a geometrical distance [47], nor the rate constants were projected from 
the potential energy surface [19,20]. Moreover, the conformers and their properties were computed from 
the MD with the only limitation of the discretization of time and space. 

Although we have applied the method to low dimensional landscapes, we expect that high dimensional 
systems could be also studied, by the combination of this technique with the usual methods to reduce 
the effective degrees of freedom (like principal component analysis or essential dynamics). In conclusion, 
the large amount of information obtained by working with the CMN. its potential application to any 
peptide with a large number of monomers, and the possibility of performing the analysis on top of CMN 
constructed via several short MD simulations [48] , make the approach presented here a promising way to 
describe the FEL of a protein. 
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Supporting Information 



1 Checking Markovity 

Although the purpose of the work introduced in the manuscript was to iUustrate a way to handle the 
Conformational Markov Network (CMN) in order to obtain the basins of attraction and characterize 
the Free Energy Landscape, the lector must be advise about the necessity of checking the markovian 
character of the model (network in our case). 



With regard to the CMNs that appear in the manuscript: 



(1) The particle in the funnel like potential is simulated using an overdamped dynamics. In this 

case, the continuous trajectory integrated is inherently Markovian [1,2]. When we discretize the co- 
ordinate space to lump the trajectory into micro-states, the Markovity of the new description can be 
defied [3,4]. Therefore, one has to care about the time step to describe the process since it must be 
larger than the longest equilibration time among the micro-states [5]. In ref. [6] three approaches are 
mentioned to evaluate the degree of Markovity of a stochastic model. Among these approaches, we have 
taken the criterion presented by Park and Pandc in [7] based on Shannon's entropy. This method pro- 
vides a unique magnitude not as sensitive to statistical and numerical noise as other methods. Another 
reason to make this choice is that a necessary and sufficient condition for Markovity is checked (while 
the observation of the eigenvalues of the transition matrix is not a sufficient condition). The analysis 
is based on the comparison of a first-order conditional entropy _ffT-(X„|X„_i) and the second-order one 
Hr{Xn\Xn-i, Xn-2) (sce [7] for further details and notation). The magnitude Rr quantifies, given X„_i, 
what fraction of the information in Xn is the mutual information between X„ and Xn-2- Although the 
method is computationally expensive, our models are small enough to carry out this analysis. The Figure 
6 reveals that with the lag time used in our analysis (r = 1) the memory effect (non-Mar kovity) is less 
than a 5%. 

We have to remark, as it is discussed in ref. [7], that there's probably no answer to the question: when 
does the model behaves Markovian? However the appropriate question should be: Given an observable, 
what is the degree of Markovity needed to have a correct measure? Regarding to our work, note that 
the analysis of the basins detection depends on the detailed balance condition, but not on the relaxation 
times. On the other hand, a small deviation such as the 5%, only affects at the analysis of the temporal 
hierarchy of basins, where Chapman-Kolmogorov is used explicitly. In order to distinguish whether the 
error is propagated (and increased) or not when the transition matrix is raised to the power of n, we 
compute the KuUback-Leibler divergence between the transition matrix 5(t)" and the matrix computed 
from the trajectory S{nT): 

DKL{S{nr)\\S{rr) = Y,Sinr),,P,\n§P^ 

The Figure 7 reveals that the error in the transitions computed with Chapman-Kolmogorov are far 
from the experimental values during a short range of nr close to the original lag time, but the diver- 
gence decrease with time after few steps. In the limit n — > oo, the matrix obtained must be equal to the 
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Figure 6. Checking Markovity: The relative mutual information Rr quantifies the degree of 
non-Markovity of our stochastic model. The lag time used to construct the CMN -funnel like potential-, 
T = 1. reveals a memory effect lower than a 5%. 
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Figure 7. The KuUback-Leibler divergence between the "experimental" transition matrix S{nT) and 
S'(t)" decreases after few time steps (t). 



Exploring the Free Energy Landscape 



18 



experimental one (having the stationary probability distribution in each column of the transition matrix). 

(ii) The alanine dipeptide is simulated with the Langevin formalism (see details in the manuscript 

and in the 5*/). The continuum trajectory in the space of coordinates and momenta is also inherently 
Markovian [1,2]. On the other hand, we are integrating momenta and discretizing the coordinates space 
in our description, which can result in a non-Markovian chain depending on the time step used. In our 
case the same analysis applied before provides a value of 0.8% of memory effect. 

2 Comparing with other community algorithms 

As it was introduced in the paper, one may be tempted to apply standard algorithms for community 
detection in complex networks to the problem of unveiling the topology of free energy landscapes. Sev- 
eral authors have made use of current community detection tools [8-13] with diverse degree of success. 
However, there is no clear conclusion about what method is the most suitable for the problem [14]. Here 
we sketch the results obtained using two different approaches [12,15] to the community detection that 
have already been used in previous related works. These results show that our method outperforms any 
of the proposed algorithms both in the computational complexity and the success in relating the CMN 
structure to the FEL. 

2.1 Maximization of Modularity 

A number of community detection algorithms have been introduced in the last years making use of the 
idea of modularity [11, 13]. The idea is to find a network partition into disconnected clusters so that a 
given function, the modularity of the network partition, is maximal. The modularity of a given partition 
accounts for the probability of having edges connecting nodes belonging to the same cluster in the network 
minus the expected probability in a randomized version of the network (having the same number of nodes 
and edges and preserving the strength of the nodes). Therefore, modularity provide a way to quantify 
the quality of a given network partition, i.e. the larger the modularity the better the partitioning is. 

The modularity, Q, for weighted and directed networks has been introduced in ref. [15,16] and, applied 
to our CMN, it takes the following form: 

Q = Y,[P^PJ^-P^PMc^,0,) (8) 

where the function 5{x,y) is the Kronecker delta function that takes value 1 if a; = y and otherwise. 
The label c; accounts for the community where node i is assigned in the network partition. 

A number of optimization algorithms have been used to look for maximal modularity partitions. 
We have chosen the spectral decomposition introduced in ref. [15] and a deterministic greedy search 
algorithm [17,18] to compare with the results obtained for the funnel-like CMN using the SSD algorithm. 
With these two methods we have obtained two optimal partitions of the CMN having Qmax = 0.54 and 
Qmax = 0.53 respectively. On the other hand, the modularity value for the basins detected by SSD only 
reaches a value Q = 0.116. This would seem a bad test for the SSD, however, as it is also shown in 
ref. [14], the community structure obtained by modularity-based methods is far from the basins shown 
in Figure IC of main text. The most approximated decomposition is obtained by the greedy search 
algorithm where 11 communities were detected. In this case the most populated basin detected for the 
funnel- like potential, the basin called a in Figure IG of main text, was split in 5 different communities. 

This check was not so unwise if one observes that Q is related with the relaxation times of a stochastic 
Markov chain represented by the matrix S in the article: 
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(9) 



where a is the community index, i is the node index, and Tk and ^ are the eigenvahies -relaxation 
times- and eigenvectors -relaxation modes- of the stochastic matrix M = S — 1. 

Despite of the possible interpretation of communities based on modularity in CMNs, the Q value 
obtained for the SSD basins is quite farway from being the maximum. 

2.2 Markov Clustering algorithm 

Alternatively to the use of modularity optimization approaches, a second type of algorithms based on 
random walks has been introduced to infer the structure of complex networks. The most interesting 
algorithm of this type is the Markov Clustering (MCL) approach [12] since it has been proved to be 
the best suited method for detecting basins in CMNs [14]. MCL starts with the stochastic matrix S 
and itcratively alternates sequences of two matrix operations (namely expansion and inflation) until 
the convergence to an invariant matrix (from which network partition is finally obtained) is achieved. 
Expansion and inflation transformations correspond to matrix squaring and matrix Hadamard p-power 
respectively. These two operations are followed to a normalization step to assure that the resulting matrix 
is stochastic at the end of each iteration. The detected network partition depends crucially on the choice 
of the parameter p > 1, often called granularity. For its minimum value p = 1 only one community is 
detected, the whole network. Therefore, the success on finding the right network community structure is 
achieved by a fine tuning of p. 
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Figure 8. Markov Clustering algorithm applied to the funnel-like potential. The figure shows the 
number of network clusters found by MCL as a function of the granularity parameter p. As it is shown 
there is a range of granularity values for which MCL is able to detect the same number of clusters as 
the SSD algorithm. 



We have performed the MCL algorithm over the CMN of the funnel-like network, see Figure 8. 
Unlike modularity-based algorithms, the MCL analysis points out that the network is divided into six 
communities (basins) for a small range of p, 1.28 < p > 1.35 (1.5 < p < 2.0 is recommended in the 
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literature), in agreement with the partition found using the SSD algorithm (see Table 3). However, the 
non-deterministic character of the method would tempt, if we do not have prior knowledge about the 
structure of the FEL, to evaluate the quality of the partitions by computing their modularity. This would 
lead us to a CMN partition in disagreement with the number of basins present in the FEL. Therefore, 
the free parameter p makes this algorithm unsuitable for analyzing general CMNs since one should have 
enough prior knowledge about the system to distinguish the best p value. 



Table 3. The number of nodes of SSD basins are compared with those of the MCL modules (p = 1.3) 
for the funnel-like potential. 
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Finally, let us remark that both modularity optimization and MCL are computationally far more 
complex than the SSD algorithm. In particular, the modularity optimization scales as 0{N^) for the 
spectral decomposition and as 0{mN) when greedy algorithm is implemented. On the other hand, 
MCL has a time complexity 0{N^). To our knowledge the unique community detection algorithm that 
scales linearly in time (specifically as 0{L + N)) (with L being the number of links in the network) 
was proposed in ref. [19]. However, this latter method assumes a prior knowledge of the number of 
network clusters and, moreover, these cluster have to be of equal size. As a conclusion, the comparison 
between SSD algorithm and standard community detection [20] methods yields a positive assessment on 
the convenience of using SSD based both on its better performance and linear time complexity. This 
makes the use of the "Stochastic steepest descent" technique the most appropriate method for analyzing 
molecular dynamics data from systems of many degrees of freedom such as proteins. 



3 Alanine dipeptide 

3.1 Molecular dynamics simulation 

The terminally blocked alanine peptide Ace-Ala-Nme has been modeled by the OPLS force field [21] 
with the program Gromacs 3.3.1 [22] and solvated with TIP4P waters [23]. A Langevin MD simulation 
has been performed at 400K with a friction coefficient of 5 ps^^. In order to set up the CMN, a single 
trajectory of 250 ns has been analyzed with conformations saved every 0.01 ps (5 MD steps). 

3.2 Direct inter-basins transitions 

An extension of Table 2 of main text is shown with the whole data for the direct transitions. 
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Table 4. Characteristic times for direct transition inter-basins. 
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